8366815: C2: Delay Mod/Div by constant transformation #27886

SirYwell · 2025-10-19T15:46:06Z

The test cases show examples of code where Value() previously wasn't run because idealization took place before, resulting in less precise type analysis.

Please let me know what you think.

Progress

Change must be properly reviewed (1 review required, with at least 1 Reviewer)
Change must not contain extraneous whitespace
Commit message must refer to an issue

Issue

JDK-8366815: C2: Delay Mod/Div by constant transformation (Enhancement - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/27886/head:pull/27886
$ git checkout pull/27886

Update a local copy of the PR:
$ git checkout pull/27886
$ git pull https://git.openjdk.org/jdk.git pull/27886/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 27886

View PR using the GUI difftool:
$ git pr show -t 27886

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/27886.diff

Using Webrev

Link to Webrev Comment

bridgekeeper · 2025-10-19T15:47:43Z

👋 Welcome back hgreule! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2025-10-19T15:47:54Z

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

openjdk · 2025-10-19T15:48:41Z

@SirYwell The following label will be automatically applied to this pull request:

hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

mlbridge · 2025-10-19T17:36:37Z

Webrevs

01: Full - Incremental (6a8d842f)
00: Full (6a392224)

mhaessig

Thank you for working on this, @SirYwell. This seems like a tricky problem. To be honest, the fix seems a bit hacky. Have you explored any alternatives to this method of delaying the optimizations?

I kicked off some testing in the meantime that I will report back upon completion.

src/hotspot/share/opto/divnode.cpp

SirYwell · 2025-10-21T17:38:04Z

Thank you for working on this, @SirYwell. This seems like a tricky problem. To be honest, the fix seems a bit hacky. Have you explored any alternatives to this method of delaying the optimizations?

I kicked off some testing in the meantime that I will report back upon completion.

Thanks for running tests. I tried delaying until post loop opts, but that prevents some vectorization and isn't really less hacky I guess. I didn't find any other good existing approach. Calculating Value before Ideal would work, but I assume that it is rarely useful, with Div/Mod being an exception. I a dream world, I guess we would have e-graphs or something similar, which would allow calculating a more precise type from different alternatives.

If you can think of a better approach, please let me know.

mhaessig · 2025-10-22T15:18:44Z

Thinking about it a bit more, I think your fix is too superficial. If the discovery of the constant is slightly delayed, nothing is folded again. Consider the followig program for an example:

class Test {
    static boolean test(int x, boolean flag) {
        Integer a;
        if (flag) {
            a = 171384;
        } else {
            a = 2902;
        }

        return x % a >= a;
    }

    public static void main(String[] args) {
        for (int i = 0; i < 20000; i++) {
            if (test(i, false)) {
                throw new RuntimeException("wrong result");
            }
        }
    }
}

In my opinion, the benefits do not outweigh the drawbacks for this PR. A better solution would probably be to delay the expansion of the Mod and Div nodes to post-loop optimizations and extend Superword to expand Div/Mod nodes to shifts. However, this is quite a bit of complexity, which raises if this complexity is worth it (@eme64 probably has opinions and/or guidance on this).

SirYwell · 2025-10-22T18:12:53Z

Thinking about it a bit more, I think your fix is too superficial. If the discovery of the constant is slightly delayed, nothing is folded again. Consider the followig program for an example:
class Test {
    static boolean test(int x, boolean flag) {
        Integer a;
        if (flag) {
            a = 171384;
        } else {
            a = 2902;
        }

        return x % a >= a;
    }

    public static void main(String[] args) {
        for (int i = 0; i < 20000; i++) {
            if (test(i, false)) {
                throw new RuntimeException("wrong result");
            }
        }
    }
}
In my opinion, the benefits do not outweigh the drawbacks for this PR. A better solution would probably be to delay the expansion of the Mod and Div nodes to post-loop optimizations and extend Superword to expand Div/Mod nodes to shifts. However, this is quite a bit of complexity, which raises if this complexity is worth it (@eme64 probably has opinions and/or guidance on this).

I'm not sure about the drawbacks here, but I think optimizing this on the superword level doesn't make things less complicated.

If cases where we end up idealizing before calling Value are a more general problem, I'd say it's worth to also address it on exactly that level: make sure that Value is called before Ideal. I'm just hesitant because I'm not aware of any other situations where this matters. One middle ground here would be some kind of Node::InitialValue(...) (or a better name :) ) that just calls bottom_type() by default and can be overridden for nodes like Mod and Div. Joining that value with the Value calculated later would solve this problem on a different level, but more effectively. But it would also be a more invasive change overall. What it doesn't resolve is users of the node to see it before it gets replaced, which probably isn't that important right now, but can be hindering when e.g., trying to detect floorMod-like patterns.

eme64

We had a bit of an offline discussion in the office yesterday. Here a summary of my thoughts.

Ordering optimizations/phases in compilers is a difficult problem, it is not at all unique to this problem or even C2, all compilers have this problem.

Doing what @SirYwell does here, with delaying to IGVN is a relatively simple fix, and it at least addresses all cases where the divisor and the comparison are already parse time constants. I would consider that a win already. But the solution is a bit hacky.

The alternative that was suggested: delay it to post-loop-opts. But that is equally hacky really, it would have the same kind of delay logic where it is proposed now, just with a different "destination" (IGVN vs post-loop-opts). And it has the downside of preventing auto vectorization (SuperWord does not know how to deal with Div/Mod, no hardware I know of implements vectorized integer division, only floating division is supported). But delaying to post-loop-opts allows cases like @mhaessig showed, where control flow collapses during IGVN. We could also make a similar example where control flow collapses only during loop-opts, in some cases only after SuperWord even (though that would be very rare).

It is really difficult to handle all cases, and I don't know if we really need to. But it is hard to know which cases we should focus on.

Here a super intense solution that would be the most powerful I can think of right now:

Delay transform_int_divide to post-loop-opts, so we can wait for constants to appear during IGVN and loop-opts.
That would mean we have to accept regressions for the currently vectorizing cases, or we have to do some transform_int_divide inside SuperWord: add an VTransform::optimize pass somehow. This would take a "medium" amount of engineering, and it would be more C++ code to maintain and test.
Yet another possibility: during loop-opts, try to do transform_int_divide not just with constant divisor, but also loop-invariant divisor. We would have to find a way to do the logic of transform_int_divide that finds the magic constants in C2 IR instead of C++ code (there seem to be some "failure" cases in the computation, not sure if we can resolve those). If the loop has sufficient iterations, it can be profitable to do the magic constant calculation before the loop, and do only mul/shift/add inside the loop. But this seems like an optional add-on. But it would be really powerful. And it would make the VTransform::optimize (SuperWord) step unnecessary.

So my current thinking is:
We have to do some kind of delay anyway, either to IGVN or post-loop-opts, or elsewhere. For now, IGVN is a step in the right direction. The "delay mechanism" is a bit hacky, but we use it in multiple places already (grep for record_for_igvn). It is not @SirYwell 's fault that our delay mechanism is so hacky.

So I would vote for going with delay to IGVN for now, to at least support the parse-time constants. Then file some RFE that tracks the other ideas, and see if someone wants to pick that up (figure out a loop-opts pass that works for loop-invariant divisors, and otherwise delay to post-loop-opts).

eme64 · 2025-10-23T04:58:33Z

src/hotspot/share/opto/divnode.cpp

+  // Keep this node as-is for now; we want Value() and
+  // other optimizations checking for this node type to work


Do we only need Value done first on the Div node, or also on uses of it?
It might be worth explaining it in a bit more detail here.

If it was just about calling Value on the Div first, we could probably check what Value returns here. But I fear that is not enough, right? Because it is the Value here that returns some range, and then some use sees that this range has specific characteristics, and can constant fold a comparison, for example. Did I get this right?

So, the main reason why I'm including Div here is mainly because of #26143; before that the DivI/LNode::Value() is actually less precise than Value on the nodes created by transform_int_divide. With #26143, some results are more precise even for constant divisors. In such case, uses can benefit from seeing the (then) more precise range. (@ichttt found a case where the replacement fails to constant-fold, but that's just due to missing constant folding in MulHiLNode)

A secondary reason is other optimizations checking for Div inputs, though I didn't find any existing check that would actually benefit. There might be optimization opportunities that want to detect division, but that's just

Generally from what I've found the benefit is bigger for Mod nodes, because there calling Value on the replacements is significantly worse. And there we also encounter typical usages in combination with range checks.

Do you want me to expand both Div and Mod comments to cover more concrete benefits, depending on the operation?

Yes, I think it would make sense to have an explanation at both ends. Your nice example with the "rounding error" of 0..1 for Div makes a lot of sense. Seeing a similar example for Mod (where it could be worse, you say) would also be nice 😊

You can copy the comments for the I/L cases, or only put it at one of them, and link from the other. There is an issue with a PR that refactors mod/div so that we only have one implementation each, and they can clean this up.

SirYwell · 2025-10-23T07:36:10Z

Thanks for the summary @eme64. I totally agree that it's a bit hacky, but the current state is the least invasive. I'd also be interested in going further steps in the same direction, but I feel like the work increases significantly more than the benefits (at least as long as we don't generalize it to also optimize for loop invariant non-constants, but that's also a lot of work).

@mhaessig do you have test results already?

eme64 · 2025-10-23T08:44:19Z

Manuel and I discussed in the office a little more :)

Can you show us a concrete example, where Div gets Idealized early, and then the generated nodes do not propagate the value range sufficiently precise for the comparison to constant fold?

I suspect that it is the value range "truncation" on the lower bits that are lost in MulHiLNode, but it would be nice to see that example ;)

Because if there is a solution that just improves the Value of the mul/shift/... nodes, that would probably be preferable.

But if we in the end need to build a Value optimization that pattern matches again through the nodes that transform_int_divide generated, that would probably be less nice, given the complexity. And then we should do the delay.

SirYwell · 2025-10-23T11:22:16Z

Manuel and I discussed in the office a little more :)

Can you show us a concrete example, where Div gets Idealized early, and then the generated nodes do not propagate the value range sufficiently precise for the comparison to constant fold?

I suspect that it is the value range "truncation" on the lower bits that are lost in MulHiLNode, but it would be nice to see that example ;)

Because if there is a solution that just improves the Value of the mul/shift/... nodes, that would probably be preferable.

But if we in the end need to build a Value optimization that pattern matches again through the nodes that transform_int_divide generated, that would probably be less nice, given the complexity. And then we should do the delay.

One very straightforward example would be something like

static boolean divFold(int a) {
	return a / 100_000 >= 21475;
}

which isn't folded to false with early idealization but works with the changes from this PR and #26143 both applied.

From my analysis, this comes from the the rounding adjustments: We need to round towards zero, so we need to add 1 (=subtract -1) for negative values. We achieve that by an right shift to produce either a 0 or a -1 and then do the subtraction with that value.

The subtraction isn't aware of the relation between the param being negative and the adjustment, and as you said, to recognize that relation, you'd more or less need to recognize that these operations form a division.

Now, I think this is the only case, and it's only off by 1 (and if the sign of the dividend is known, it also isn't a problem), so I'm wondering if there are any common patterns where this would be relevant, otherwise it might really make sense to just delay Mod and accept this edge case for Div.

eme64 · 2025-10-23T11:39:13Z

Thanks very much for the explanation and the nice graph 😊

That helps a lot. It also means that even for cases like @mhaessig showed above:
#27886 (comment)

We could still constant fold the comparison.... as long as the comparison is "relaxed enough". It might be worth having a handfull of examples like that: some that still constant fold, and some that don't because the comparison is too "sharp", and the "rounding error" too large. What do you think?

SirYwell · 2025-10-23T15:05:47Z

We could still constant fold the comparison.... as long as the comparison is "relaxed enough". It might be worth having a handfull of examples like that: some that still constant fold, and some that don't because the comparison is too "sharp", and the "rounding error" too large. What do you think?

Do you mean as part of the comment? That should be doable and provide useful context, yes.

Edit: Mod is still harder to constant fold (or even get any precise information from) as we have an additional multiplication and subtraction there. So the example would probably still fail.

eme64 · 2025-10-23T15:18:43Z

Do you mean as part of the comment? That should be doable and provide useful context, yes.

Exactly, yes please 😊

iwanowww · 2025-10-23T19:27:05Z

It seems the root cause is a divergence in functionality in GVN between different representations of Div/Mod nodes.

How hard would it be to align the implementation and improve GVN handling for other representations, so the same type is constructed irrespective of the IR shape?

Alternatively, as part of the expansion, new representation can be wrapped in CastII/CastLL with the narrower type of original Div/Mod node.

eme64 · 2025-10-24T07:09:25Z

@iwanowww I'm not quite following your suggestions / questions.

It seems the root cause is a divergence in functionality in GVN between different representations of Div/Mod nodes.

How hard would it be to align the implementation and improve GVN handling for other representations, so the same type is constructed irrespective of the IR shape?

Do you consider the "expanded" versions of Div/Mod as a "different representation of Div/Mod"?
If yes: we could pattern match for such "expanded" versions of Div/Mod, but it would be quite complex: you would have to parse through patterns like displayed up here. Do you think that is a good idea? I already mentioned the idea up here, but did not think it was desirable due to the complexity.

Alternatively, as part of the expansion, new representation can be wrapped in CastII/CastLL with the narrower type of original Div/Mod node.

How does this "wrapping" help? After parsing, the CastII at the bottom of the "expanded" Div would just have the whole int range. How would the type of the CastII ever be improved, without pattern matching the "expanded" Div?

SirYwell · 2025-10-24T07:23:02Z

It seems the root cause is a divergence in functionality in GVN between different representations of Div/Mod nodes.

How hard would it be to align the implementation and improve GVN handling for other representations, so the same type is constructed irrespective of the IR shape?

For Div, we can have either the magic multiply/shift variant, or shifting for powers of 2. Each of these can have slightly different shapes depending on e.g., the sign of the divisor, the sign of the dividend, and the sign of the magic constant.
For DivL, we also use MulHi if supported and a different sequence of instructions otherwise.

For Mod, we either do what we do for Div, multiply again and subtract to get the remainder; or we just directly use And, or we do the Mersenne number optimization (related: https://bugs.openjdk.org/browse/JDK-8370135) which unrolls the same few operations multiple times.

Generally, there also isn't one guaranteed result node (like e.g., Sub) where we could place code that recognizes these patterns and provides better results, so I don't think this is feasible (for Div it might be doable, at least I only found that off-by-one overapproximation that could be dealt with in Sub).

Alternatively, as part of the expansion, new representation can be wrapped in CastII/CastLL with the narrower type of original Div/Mod node.

I thought about that a bit as well and I think it has the same downside as the current approach: As soon as we don't use the Div/Mod node anymore, making the inputs more precise doesn't help anymore. We still have the Cast node, but that node doesn't know how to recalculate/improve its own type.
(Additionally, but less of a problem, a Cast node would require optimizations checking their inputs for Div/Mod nodes to uncast).

Basically, this comes back to what e-graphs do better: remember multiple alternative constructs for the same semantic operation. Without considering whether that's realistic, if a Cast node would keep the original operation alive somehow (but that operation isn't further optimized itself, I guess), then the Cast node could recalculate its type depending on multiple variants and choose the more specific result even at later stages of optimization.

That said, I'm not in the position if using Cast nodes is more idiomatic, and I'm open to rework the PR to use Cast nodes if you want.

eme64 · 2025-10-24T07:44:36Z

@SirYwell We were probably typing at the same time ;)

SirYwell · 2025-10-24T09:07:59Z

@SirYwell We were probably typing at the same time ;)

Indeed :)

I still updated the comments now as they are useful independently of the ongoing discussion. Feel free to give your opinion on that :)

I also noticed that x % 2 == 0 isn't optimized to (x & 1) == 0, which is a bit surprising. I found https://bugs.openjdk.org/browse/JDK-8210152 but that seems to be different, so maybe there is a regression somewhere if that worked before?

iwanowww · 2025-10-24T18:04:57Z

Do you consider the "expanded" versions of Div/Mod as a "different representation of Div/Mod"?

Yes, exactly.

we could pattern match for such "expanded" versions of Div/Mod, but it would be quite complex: you would have to parse through patterns like displayed #27886 (comment). Do you think that is a good idea?

Sure, It may be way above the complexity budget we are willing to spend on it. The expansion code I see for Div/Mod nodes doesn't look too complicated, but matching the pattern may require more effort. The positive thing is it'll optimize the pattern irrespective of the origin (either expanded Div/Mod or explicitly optimized in the code by the user). So, the question is how much complexity it requires vs scenarios it covers.

How does this "wrapping" help? After parsing, the CastII at the bottom of the "expanded" Div would just have the whole int range. How would the type of the CastII ever be improved, without pattern matching the "expanded" Div?

It's not fully clear to me what is the scope of problematic scenarios. If it's only about Ideal() expanding the node before Value() has a chance to run, then wrapping the result of expansion in CastII/CastLL node and attach Value() as it's type should be enough (when produced type is narrower than Type::INT).

If we want to to keep expanded shape while being able to compute its type as if it were the original node, then a new flavor of Cast node may help. The one which keeps the node type and its inputs and can run Value() as if it were the original node.

eme64 · 2025-10-27T06:49:08Z

@iwanowww I see, so we could implement something like a CastII with multiple inputs, which we know must all be identical at runtime. The first input is the one we will in the end pick. But during Value, we take the intersection of all input ranges. So if another (not the first input) has a narrower type, we can use that type. I suppose that would be feasible. Do you have a good name for such a node?

What I don't know: how does that interact with other IGVN optimizations, especially those that want to pattern match specific nodes? Inserting such special cast nodes could interrupt Ideal optimizations, current pattern matching would not know how to deal with it. Probably it is not a big issue, but I'm not sure.

SirYwell added 2 commits October 19, 2025 17:41

test

98ae73c

delay integral Div/Mod Ideal() until IGVN

6a39222

openjdk bot changed the title ~~8366815~~ 8366815: C2: Delay Mod/Div by constant transformation Oct 19, 2025

openjdk bot added the hotspot-compiler [email protected] label Oct 19, 2025

SirYwell marked this pull request as ready for review October 19, 2025 17:31

openjdk bot added the rfr Pull request is ready for review label Oct 19, 2025

mhaessig reviewed Oct 21, 2025

View reviewed changes

src/hotspot/share/opto/divnode.cpp Outdated Show resolved Hide resolved

eme64 reviewed Oct 23, 2025

View reviewed changes

expand comments

6a8d842

		// Keep this node as-is for now; we want Value() and
		// other optimizations checking for this node type to work

Uh oh!

8366815: C2: Delay Mod/Div by constant transformation #27886

Are you sure you want to change the base?

8366815: C2: Delay Mod/Div by constant transformation #27886

Conversation

SirYwell commented Oct 19, 2025 • edited by openjdk bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Progress

Issue

Reviewing

Uh oh!

bridgekeeper bot commented Oct 19, 2025

Uh oh!

openjdk bot commented Oct 19, 2025

Uh oh!

openjdk bot commented Oct 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mlbridge bot commented Oct 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Webrevs

Uh oh!

mhaessig left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SirYwell commented Oct 21, 2025

Uh oh!

mhaessig commented Oct 22, 2025

Uh oh!

SirYwell commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eme64 left a comment

Choose a reason for hiding this comment

Uh oh!

eme64 Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

SirYwell Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

eme64 Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

SirYwell commented Oct 23, 2025

Uh oh!

eme64 commented Oct 23, 2025

Uh oh!

SirYwell commented Oct 23, 2025

Uh oh!

eme64 commented Oct 23, 2025

Uh oh!

SirYwell commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eme64 commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iwanowww commented Oct 23, 2025

Uh oh!

eme64 commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SirYwell commented Oct 24, 2025

Uh oh!

eme64 commented Oct 24, 2025

Uh oh!

SirYwell commented Oct 24, 2025

Uh oh!

iwanowww commented Oct 24, 2025

Uh oh!

eme64 commented Oct 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

SirYwell commented Oct 19, 2025 •

edited by openjdk bot

Loading

openjdk bot commented Oct 19, 2025 •

edited

Loading

mlbridge bot commented Oct 19, 2025 •

edited

Loading

SirYwell commented Oct 22, 2025 •

edited

Loading

SirYwell commented Oct 23, 2025 •

edited

Loading

eme64 commented Oct 23, 2025 •

edited

Loading

eme64 commented Oct 24, 2025 •

edited

Loading